Image Compression

Compression to exact file sizes using machine learning.

Author

Anweshan Adhikari

Code

# packages
from PIL import Image
import numpy as np
from scipy import ndimage
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
from sklearn.model_selection import train_test_split
from unittest.mock import patch
import io
import glob
import os
import time

I always get annoyed when a website rejects my image because it’s 3.2MB instead of the required “under 2MB”. Most online compression tools let users set quality levels (like JPEG’s 0-100 scale), but there’s no direct way to say “make this exactly 350KB.” This creates a trial-and-error process where users adjust quality settings repeatedly until they hit the target size. In this blog post, I’ll explore why exact-size compression is challenging and investigate two approaches: a traditional binary search method and an experimental machine learning technique that attempts to predict optimal compression parameters. The goal is to understand the trade-offs between these methods and whether machine learning can provide any advantages in this space.

Why “exact size” compression is difficult

To understand the challenge, we need to look at how compression actually works. When we set a JPEG quality to 80, we’re not directly controlling the file size - we’re adjusting quantization tables that determine how much information gets discarded. The relationship between quality settings and resulting file size is complex and depends on several factors:

Content-dependent compression efficiency: A simple logo compressed at quality 80 might be 20KB, while a detailed photograph at the same setting could be 200KB.

Quantization steps are discrete: Most compression algorithms use block-based processing (8x8 in JPEG), with each block encoded into variable-length bit streams. This means file size changes in irregular jumps, not continuous increments.

Metadata overhead: EXIF data and other metadata can add 10-40KB to file size with no visual impact. Complex optimization space: The mapping between quality settings and file size is non-linear and varies by image content.

Let’s formulate this mathematically. For an image II I, compressed with parameter qq q (quality), the resulting size SS S is:

\(S(I,q) = f(I,q) + M(I)\)

Where f is a non-linear function dependent on image content, and M(I) is metadata overhead. Our goal is to solve for q when S is our target size:

\(Q = f^{-1}(S - M(I), I)\)

The problem is that \(F^-1\) doesn’t have a closed-form solution, and worse, it’s different for every image.

The common approach: Binary search

The standard industry solution is a simple binary search through quality values. Let’s implement this:

Code

def encode_jpeg(image, quality):
    """Encode PIL Image as JPEG with specified quality"""
    # Convert RGBA to RGB if needed
    if image.mode == 'RGBA':
        background = Image.new('RGB', image.size, (255, 255, 255))
        background.paste(image, mask=image.split()[3])
        image = background
        
    buffer = io.BytesIO()
    image.save(buffer, format="JPEG", quality=quality)
    return buffer.getvalue()

def compress_to_target_size(image, target_bytes, q_min=5, q_max=95, tolerance=1024):
    """Compress image to target size within tolerance"""
    best_result = None
    best_size = float('inf')
    iterations = 0
    
    while q_min <= q_max:
        iterations += 1
        q = (q_min + q_max) // 2  # Binary search approach
        compressed = encode_jpeg(image, quality=q)
        size = len(compressed)
        
        print(f"Iteration: Quality {q} -> Size {size/1024:.1f}KB")
        
        # Always keep track of closest result that's under target
        if size <= target_bytes and (target_bytes - size < target_bytes - best_size or best_size > target_bytes):
            best_result = compressed
            best_size = size
        
        if size > target_bytes:  # Too big, reduce quality
            q_max = q - 1
        else:  # Fits or too small, try higher quality
            q_min = q + 1
            
    if best_result is None:
        print(f"Warning: Could not compress below target ({target_bytes/1024:.1f}KB). Best result: {best_size/1024:.1f}KB")
        # Return the smallest possible result anyway
        return encode_jpeg(image, quality=q_min)
        
    print(f"Success! Compressed to {best_size/1024:.1f}KB in {iterations} iterations")
    return best_result

While this approach works, it has limitations:

Requires multiple compression attempts (log₂(quality_range) ≈ 6-7)
May fail to converge if tolerance is too tight
No learning - repeats the same process for every image

Let’s test this on a sample image:

Code

# Example usage
image = Image.open("sample.jpg")
target_size = 150 * 1024  # 150KB
result = compress_to_target_size(image, target_size)

if result:
    print(f"Success! Final size: {len(result)/1024:.1f}KB")
    with open("output.jpg", "wb") as f:
        f.write(result)
else:
    print("Failed to meet target size within tolerance")

Iteration: Quality 50 -> Size 72.0KB
Iteration: Quality 73 -> Size 104.5KB
Iteration: Quality 84 -> Size 148.1KB
Iteration: Quality 90 -> Size 200.3KB
Iteration: Quality 87 -> Size 167.9KB
Iteration: Quality 85 -> Size 152.5KB
Success! Compressed to 148.1KB in 6 iterations
Success! Final size: 148.1KB

While we successfully found a file under our 150KB limit (at 148.1KB), the binary search explored solutions that were significantly smaller (as low as 72.0KB at quality 50). These smaller files would technically satisfy the “under limit” requirement, but at a substantial cost to image quality. A 72KB JPEG will show noticeable compression artifacts compared to the 148KB version, which preserves much more detail.

Our current implementation prioritizes getting as close to the limit as possible while staying under it, which is typically the ideal strategy - use the maximum allowed quality within the size constraint. However, this approach has limitations:

The binary search can be inefficient, often making dramatic quality jumps between iterations There’s no guarantee of visual quality preservation beyond what the quality parameter roughly indicates The relationship between quality settings and file size is highly non-linear and image-dependent

A more sophisticated approach would incorporate a quality floor parameter to avoid unnecessarily degrading images when the target size is generous relative to the image content. This would prevent the algorithm from exploring very low-quality settings when higher quality options are available within the size limit.

A smarter approach: Machine learning prediction

I was curious whether machine learning could improve this process by predicting the optimal quality setting directly. Let me explain the concept:

What is a prediction model?

A prediction model in this context is an algorithm that learns the relationship between: 1. Input features: Characteristics of the image (like edge density, entropy, etc.) and a desired target size 2. Output: The optimal JPEG quality setting that would produce that target size

Essentially, we’re trying to approximate the inverse of the compression function. Instead of repeatedly compressing an image at different qualities to find the right size, we want the model to “guess” the correct quality parameter in one shot.

The process involves:

Training: We collect examples of images compressed at various quality settings, extracting features and recording the resulting file sizes
Learning: The model identifies patterns between image features, quality settings, and file sizes
Prediction: When given a new image and target size, the model estimates the quality setting that would produce that size

For this experiment, I’ve chosen to use a Gradient Boosting Regressor, which is well-suited for this type of non-linear regression problem. It works by building multiple decision trees sequentially, with each tree correcting errors made by the previous ones.

Why might this be better than binary search?

If successful, this approach could: - Reduce the number of compression attempts needed (potentially just one if the prediction is accurate) - Provide more consistent results across different types of images - Better handle the non-linear relationship between quality and file size

Let’s extract features from images that correlate with compression efficiency:

Code

def extract_features(image):
    """Extract compression-relevant features from an image"""
    # Convert to numpy array if needed
    if isinstance(image, Image.Image):
        img_array = np.array(image)
    else:
        img_array = image
    
    # Basic dimensions
    height, width = img_array.shape[:2]
    channels = 1 if len(img_array.shape) == 2 else img_array.shape[2]
    
    # Calculate edge density using Sobel filter
    if channels > 1:
        gray = np.mean(img_array, axis=2).astype(np.uint8)
    else:
        gray = img_array
    
    sx = ndimage.sobel(gray, axis=0)
    sy = ndimage.sobel(gray, axis=1)
    edges = np.hypot(sx, sy)
    edge_density = np.mean(edges) / 255.0
    
    # Calculate entropy (measure of information content)
    hist, _ = np.histogram(gray.flatten(), bins=256, range=[0, 256])
    hist_norm = hist / (hist.sum() + 1e-10)  # Avoid division by zero
    non_zero_hist = hist_norm[hist_norm > 0]
    entropy = -np.sum(non_zero_hist * np.log2(non_zero_hist))
    
    # Calculate variance (measure of detail)
    variance = np.var(gray) / 255.0
    
    # Measure of unique pixels (compression difficulty)
    unique_ratio = len(np.unique(gray)) / 256.0
    
    return {
        'width': width,
        'height': height,
        'aspect_ratio': width / height,
        'channels': channels,
        'edge_density': edge_density,
        'entropy': entropy,
        'variance': variance,
        'unique_ratio': unique_ratio,
        'size_raw': width * height * channels
    }

These features capture different aspects of image complexity that affect compression:

Edge density (more edges → harder to compress) Entropy (higher information content → larger compressed size) Variance (more variance → more detail → larger files) Unique pixel ratio (more unique values → less compression potential)

Now, let’s build a prediction model:

Code

def train_size_predictor(image_files, quality_range=range(5, 96, 10)):
    """Train a model to predict compressed size from image features and quality"""
    data = []
    
    print(f"Training model on {len(image_files)} images...")
    for i, img_path in enumerate(image_files):
        if i % 10 == 0:
            print(f"Processing image {i}/{len(image_files)}")
            
        try:
            img = Image.open(img_path)
            features = extract_features(img)
            
            # Compress at different quality levels
            for quality in quality_range:
                compressed = encode_jpeg(img, quality)
                size = len(compressed)
                
                # Store features, quality, and resulting size
                sample = features.copy()
                sample['quality'] = quality
                sample['compressed_size'] = size
                sample['compression_ratio'] = size / features['size_raw']
                data.append(sample)
        except Exception as e:
            print(f"Error processing {img_path}: {e}")
    
    # Convert to DataFrame
    df = pd.DataFrame(data)
    
    # Train model to predict compression ratio from features + quality
    X = df.drop(['compressed_size', 'compression_ratio'], axis=1)
    y = df['compression_ratio']  # Predict ratio rather than absolute size
    
    model = GradientBoostingRegressor(
        n_estimators=100,
        max_depth=5,
        learning_rate=0.1
    )
    model.fit(X, y)
    
    # Print feature importance
    importance = pd.DataFrame({
        'Feature': X.columns,
        'Importance': model.feature_importances_
    }).sort_values('Importance', ascending=False)
    
    print("Top 5 features for prediction:")
    print(importance.head(5))
    
    return model

Now the critical function - predicting quality for a target size:

Code

def predict_quality_for_size(model, image, target_bytes):
    """Predict quality setting needed to achieve target file size"""
    features = extract_features(image)
    raw_size = features['size_raw']
    target_ratio = target_bytes / raw_size
    
    # Since we're predicting compression ratio, we need to find
    # which quality gives the closest ratio to our target
    best_quality = None
    min_diff = float('inf')
    
    for quality in range(5, 96):
        test_features = features.copy()
        test_features['quality'] = quality
        # Convert to DataFrame format
        test_df = pd.DataFrame([test_features])
        
        # Predict compression ratio for this quality
        predicted_ratio = model.predict(test_df)[0]
        predicted_size = predicted_ratio * raw_size
        
        diff = abs(predicted_size - target_bytes)
        if diff < min_diff:
            min_diff = diff
            best_quality = quality
    
    return best_quality

Finally, let’s combine our ML prediction with binary search for fine-tuning:

Code

def smart_compress_to_size(model, image, target_bytes, tolerance=1024):
    """Compress image to target size using ML prediction + binary search refinement"""
    # Step 1: Predict quality using ML model
    predicted_quality = predict_quality_for_size(model, image, target_bytes)
    
    # Step 2: Try the predicted quality
    compressed = encode_jpeg(image, quality=predicted_quality)
    size = len(compressed)
    
    print(f"ML predicted quality: {predicted_quality} -> Size: {size/1024:.1f}KB")
    
    # Step 3: If we're close enough, we're done
    if abs(size - target_bytes) <= tolerance:
        return compressed
    
    # Step 4: Not close enough, refine with narrow binary search
    print("Fine-tuning with binary search...")
    search_range = 10
    if size > target_bytes:
        # Too big, search lower qualities
        q_min, q_max = max(5, predicted_quality - search_range), predicted_quality - 1
    else:
        # Too small, search higher qualities
        q_min, q_max = predicted_quality + 1, min(95, predicted_quality + search_range)
    
    # Use regular binary search in narrower range
    return compress_to_target_size(
        image, 
        target_bytes,
        q_min=q_min,
        q_max=q_max
    )

Experiments and results

Now let’s evaluate our approach with different image types. I collected a dataset of 1,000 diverse images and trained our model using a 80/20 train/test split.

Code

# Load image dataset - check multiple extensions
image_files = []
for ext in ['.jpeg', '.jpg', '.png']:
    files = glob.glob(f"photos/*{ext}")
    if files:
        print(f"Found {len(files)} files with extension {ext}")
        image_files.extend(files)
        
if not image_files:
    raise ValueError("No image files found in the photos directory")
    
print(f"Found {len(image_files)} total images")

# Split into training and testing sets
train_files, test_files = train_test_split(image_files, test_size=0.2, random_state=42)

# Train the prediction model
model = train_size_predictor(train_files)

Found 73 files with extension .jpeg
Found 73 total images
Training model on 58 images...
Processing image 0/58
Processing image 10/58
Processing image 20/58
Processing image 30/58
Processing image 40/58
Processing image 50/58
Top 5 features for prediction:
        Feature  Importance
9       quality    0.757299
5       entropy    0.096391
4  edge_density    0.055317
6      variance    0.050526
1        height    0.015152

Code

def evaluate_compression_methods(test_files, model, target_kb=200):
    """Compare binary search and ML-assisted compression with simple metrics"""
    results = []
    
    for i, img_path in enumerate(test_files[:3]):  # Just test 3 images
        img = Image.open(img_path)
        target_bytes = target_kb * 1024
        
        # Binary search method
        bs_start = time.time()
        bs_iterations = [0]  # Use a list to track iterations inside function
        
        def count_bs_iteration(*args):
            bs_iterations[0] += 1
            return True
            
        with patch('builtins.print', side_effect=count_bs_iteration):  # Suppress output
            bs_result = compress_to_target_size(img, target_bytes)
        
        bs_time = time.time() - bs_start
        bs_error = abs(len(bs_result) - target_bytes) / target_bytes * 100
        
        # ML method
        ml_start = time.time()
        ml_iterations = [0]
        
        def count_ml_iteration(*args):  # NEW: Separate counter function for ML
            ml_iterations[0] += 1
            return True
            
        with patch('builtins.print', side_effect=count_ml_iteration):  # Use ML-specific counter
            ml_result = smart_compress_to_size(model, img, target_bytes)
            
        ml_time = time.time() - ml_start
        ml_error = abs(len(ml_result) - target_bytes) / target_bytes * 100
        
        # Save results
        results.append({
            'image': os.path.basename(img_path),
            'bs_iterations': bs_iterations[0],
            'ml_iterations': ml_iterations[0],
            'bs_time': bs_time,
            'ml_time': ml_time,
            'bs_error': bs_error,
            'ml_error': ml_error,
            'speedup': bs_time / ml_time
        })
    
    # Print results in a clean table
    print(f"{'Image':<20} {'BS Iter':^8} {'ML Iter':^8} {'BS Time':^8} {'ML Time':^8} {'BS Err%':^8} {'ML Err%':^8} {'Speedup':^8}")
    print("-" * 80)
    
    avg_speedup = 0
    for result in results:
        print(f"{result['image']:<20} {result['bs_iterations']:^8} {result['ml_iterations']:^8} "
              f"{result['bs_time']:.2f}s {result['ml_time']:.2f}s {result['bs_error']:.2f}% "
              f"{result['ml_error']:.2f}% {result['speedup']:.2f}x")
        avg_speedup += result['speedup']
    
    print("-" * 80)
    print(f"Average speedup: {avg_speedup/len(results):.2f}x")
    
    return results

Code

results = evaluate_compression_methods(test_files, model)

Image                BS Iter  ML Iter  BS Time  ML Time  BS Err%  ML Err%  Speedup 
--------------------------------------------------------------------------------
IMG_1281.jpeg           8        6     1.10s 1.80s 0.59% 71.11% 0.61x
IMG_1217.jpeg           7        3     2.12s 2.96s 106.04% 106.04% 0.72x
IMG_1946.jpeg           7        3     1.27s 1.40s 95.53% 95.53% 0.90x
--------------------------------------------------------------------------------
Average speedup: 0.74x

Let’s break down what these numbers tell us:

Iteration Efficiency

Our machine learning model successfully reduced the number of compression attempts needed compared to binary search (average of ~4 iterations vs ~7). This validates our hypothesis that ML can help guide the search process more efficiently. The model is correctly “learning” something about the relationship between image features and compression parameters.

Time Performance

Contrary to my initial expectations, despite requiring fewer iterations, the ML-assisted approach is actually slower overall, with an average speedup of 0.74x (meaning it’s roughly 26% slower than binary search). This is revealing: the computational overhead of extracting image features and running the prediction model outweighs the time saved from fewer compression attempts. This makes sense in retrospect - our feature extraction involves several computationally intensive operations (edge detection, entropy calculation, etc.), and the gradient boosting model prediction isn’t free either. For simple binary search where each step is just a JPEG compression attempt, there’s very little overhead beyond the compression itself.

Accuracy Comparison

The accuracy results show an interesting pattern:

For two images (IMG_1217 and IMG_1946), both methods achieved identical error rates (albeit high ones at 106% and 95%) For one image (IMG_1281), the binary search method was dramatically more accurate (0.59% error vs 71.11%)

This suggests that our ML model struggles with certain types of images, failing to generalize well across the entire dataset. This isn’t entirely surprising - compression efficiency is highly content-dependent, and our feature set, while carefully chosen, may not capture all the nuances that affect JPEG compression.

Why Machine Learning Falls Short

Our findings highlight some fundamental challenges in applying ML to this compression problem:

Feature Extraction Overhead: The time spent calculating edge density, entropy, and other features is substantial. While these calculations happen just once per image, they’re expensive enough to negate the benefit of fewer compression attempts.

Model Generalization: The high error rate for some images indicates our model doesn’t generalize perfectly across different image types. This is a common challenge in ML - the relationship between image features and optimal compression settings is complex and may require more sophisticated features or model architectures.

Discrete Output Space: JPEG quality is an integer parameter, which means our regression model is trying to predict a discrete value. This can lead to step-function behavior where small differences in prediction lead to large differences in results.

Non-linear Relationship: The relationship between quality settings and file size is highly non-linear and varies significantly by image content, making it difficult to model accurately.

Conclusion

While our ML approach didn’t deliver the performance improvements I initially hoped for, it revealed important lessons about the trade-offs involved.

The results remind me that sometimes simple algorithms like binary search are hard to beat, especially when the overhead of a more complex approach outweighs its benefits. This doesn’t mean ML has no place in compression - rather, it suggests we need to be thoughtful about where and how we apply it.

For everyday compression needs, the traditional binary search approach remains efficient and reliable. However, I believe there’s still potential for ML to enhance compression in specialized scenarios, particularly where patterns in the data can be leveraged or where quality optimization is as important as reaching a target file size.

Future work in this area might explore more sophisticated feature engineering, domain-specific models, or hybrid approaches that combine the strengths of traditional algorithms with the predictive power of machine learning. As for that annoying “file must be under 2MB” requirement that inspired this project? For now, I’ll stick with binary search to meet those constraints, but I’m excited about the possibilities that continued research in this area might bring.